{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "97533445-2614-467a-9294-a823edd7eea7",
   "metadata": {},
   "source": [
    "### Interactive Visualization of Chemical Space\n",
    "In many situtations, we want to be able quickly visualize the chemical space occupied by a set of compounds.  In this space, chemically similar compounds will be close together and dissimilar compounds will be farther apart.  This notebook provides a brief example of how to create an interactive plot where the chemical structures of compounds corresponding to selected points are shown below the plot. \n",
    "\n",
    "**Important Note if You're Running in Colab**   \n",
    "After the libraries are installed, you'll see a message saying \"Your session has crashed, automatically restarting\".  Don't worry about this.  We're simply forcing Colab to restart and pick up the newly installed libraries. Continue to execute the notebook cells, everything will work. "
   ]
  },
  {
   "cell_type": "code",
   "id": "ac55185f-d2da-4d14-85b1-6317ccf4d16d",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T22:09:09.770385Z",
     "start_time": "2025-05-05T22:09:09.768302Z"
    }
   },
   "source": [
    "%%capture\n",
    "import sys\n",
    "IN_COLAB = 'google.colab' in sys.modules\n",
    "if IN_COLAB:\n",
    "    !pip install useful_rdkit_utils jupyter-scatter mols2grid scikit-learn\n",
    "    exit()"
   ],
   "outputs": [],
   "execution_count": 1
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "106eafa6-34c0-4a70-94df-f79196ca259c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import useful_rdkit_utils as uru\n",
    "from sklearn.manifold import TSNE\n",
    "from sklearn.decomposition import PCA\n",
    "import numpy as np\n",
    "import jscatter\n",
    "import mols2grid\n",
    "import ipywidgets\n",
    "import warnings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac459a03-2d84-4e1e-a84c-35182f09d001",
   "metadata": {},
   "source": [
    "#### 1. Read the input data\n",
    "Read a dataset with drugs from the [ChEMBL](https://www.ebi.ac.uk/chembl/) database. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f33c9e40-abda-4e1e-a0db-80531fa59b46",
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://raw.githubusercontent.com/PatWalters/datafiles/refs/heads/main/chembl_drugs.smi\"\n",
    "df = pd.read_csv(url,sep=\" \",names=[\"SMILES\",\"Name\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac36dc49-51a8-485f-9409-ce816ba780a5",
   "metadata": {},
   "source": [
    "#### 2. Generate chemical fingerprints\n",
    "Instantiate a fingerprint generator object from [useful_rdkit_utils](https://github.com/PatWalters/useful_rdkit_utils). This is just a convenience wrapper around the [RDKit Morgan fingerprint generator](https://greglandrum.github.io/rdkit-blog/posts/2023-01-18-fingerprint-generator-tutorial.html). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "f46212b3-6ac6-460f-93ee-4fc8199c9c2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "smi2fp = uru.Smi2Fp()\n",
    "df['fp'] = df.SMILES.apply(smi2fp.get_np)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2795e605-25fd-4cd1-85b9-5bda658f8c14",
   "metadata": {},
   "source": [
    "#### 3. Reduce the fingerprint dimensionality with PCA\n",
    "We are going to use Truncated Stochasitc Neighbor Embedding (TSNE) to project the chemical fingerprints generated above into two dimensions. TSNE works better when the dimensionality of the input data has been reduced to ~50 features.  We will use Principal Component Analysis (PCA) to reduce the fingerprints to 50 dimensions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "afb7813f-700e-4a81-9ee5-f1ad5c76bb54",
   "metadata": {},
   "outputs": [],
   "source": [
    "pca = PCA(n_components=50)\n",
    "pcs = pca.fit_transform(np.stack(df.fp))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bedf9266-4027-4b99-bc47-8a782ecc7205",
   "metadata": {},
   "source": [
    "#### 4. Project the PCs into two dimensions with TSNE\n",
    "Now we can reduce the 50 dimensional principal components to 2 dimensions for plotting. Note that I used a context manager to catch a few annoying warning messages. The coordinates from the TSNE projection are added to the dataframe as **tsne_x** and **tsne_y**. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "3ebccd32-7cfc-43d6-ac2e-ddda37a61188",
   "metadata": {},
   "outputs": [],
   "source": [
    "with warnings.catch_warnings():\n",
    "    warnings.filterwarnings(\"ignore\",category=FutureWarning)\n",
    "    tsne = TSNE(n_components=2,init='pca')\n",
    "    df[[\"tsne_x\",\"tsne_y\"]] = tsne.fit_transform(pcs).tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8923a3e-10cb-4fca-906d-bd82a7082c6d",
   "metadata": {},
   "source": [
    "#### 5. Generate an interactive scatterplot\n",
    "That's all we need to do. Now we can make a plot of chemical space using the nifty [Jupyter Scatter](https://github.com/flekschas/jupyter-scatter) component.  You can control this component using the icons on the left of the plot.  Click on the second icon from the top to put the compoent into selection mode.  Click and drag to select a set of points, and the corresponding chemical structures will be shown below the plot.  The third icon from the top can be used to change the selection mode.  For efficiency, I've limited the display to 25 chemical structures.  This can be easily changed in the code block below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "78afc51d-078e-42d2-abee-8a05cfc48d62",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "88ee0b33fb6247e48fcd220e5c5b29cb",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VBox(children=(VBox(children=(HBox(children=(VBox(children=(Button(icon='arrows', style='primary', width=36), …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "scatter = jscatter.Scatter(data=df,x=\"tsne_x\", y=\"tsne_y\")\n",
    "output = ipywidgets.Output()\n",
    "\n",
    "@output.capture(clear_output=True)\n",
    "def selection_change_handler(change):\n",
    "    display(mols2grid.display(df.loc[change.new].head(25),subset=[\"img\",\"Name\"],template=\"static\",prerender=True,size=(200,200)))\n",
    "            \n",
    "scatter.widget.observe(selection_change_handler, names=[\"selection\"])\n",
    "\n",
    "ipywidgets.VBox([scatter.show(), output])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22d6dea6-d61a-4fec-8d1c-ab019ae300ec",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
